Jamovi - Descriptive statistics I.
Problem 1
In this section, you will learn how to explor nominal data.
The comics.csv dataset has information on all comic characters that have been introduced by DC and Marvel. The variables:
name- Name of the character - IDid- Personal identity of comic book characters - Nominal - Levels: “No Dual”, “Public”, “Secret”, “Unknown”align- Alignment of comic book characters - Nominal - Levels: “Bad”, “Good”, “Neutral”, “Reformed Criminals”eye- Eye colour - Nominal - Levels: “Amber Eyes”, …hair- Hair colour - Nominal - Levels: “Auburn Hair”, …gender- Gender - Nominal - Levels: “Female”, “Male”, “Other”gsm- Gender and sexuality minorities - Nominal - Levels: “Bisexual Characters”, …alive- Is the character alive or dead? - Nominal - Levels: “Deceased Characters”, “Living Characters”appearances- All appearances of the character - Continuousfirst_appear- Firs appear of the character - Nominal - Levels: “Aug-62”, …publisher- Publisher Company - Nominal - Levels: “dc”, “marvel”
Download
comics.csvfrom here. Opencomics.csvdataset in jamovi.Check the measurement type of the variables and provide a description for each variable.
- Determine the number of rows and columns. Determine the number of observations and variables.
We have 23272 observations, or cases, or rows and 11 variables, or columns.
- Two publishers, Marvel and DC, have created a host of superheroes that have made their way into popular culture. You’re probably familiar with Batman and Spiderman, but what about Mor the Mighty or Samuel Guthrie? Learn more about these two characters from the database.
or and Hide filter columnsDuring filtering, use the filter condition above and select the Hide filter columns option, which also hides the filter columns and unnecessary rows. As we can see, both characters are from the Marvel universe and both are male.
After the examination, do not forget to deactivate the filter.
Check the number of valid and missing values for variable
align,genderandid.Select
Analyses / Exploration / Descriptivesmenu. Move the 3 variables into the ‘Variables’ box.
The first output of descriptive statistical analyzes is the “missing data report”. Jamovi automatically recognises blank fields as missing values - no further action is needed unless you have other values that should be considered missing. The output above shows that the number of missing values is high. The number of valid data (non-missing values) in the align variable is 19859, while the number of missing values is 3413.
Often, participants don’t answer every question on a questionnaire, observers miss a particular behavior, recording devices fail, there was a glitch in a program, or a participant’s results are lost. These kinds of situations create missing values. (When entering your data, it is usually best to leave the place blank where a missing value would go. If instead you put in some number like 999 or –1, it could end up being included in the figuring later!)
It’s clear that the alignment variable (
align) can be “good” or “neutral”, but what other values are possible?The following labels (levels) are expected to appear in the descriptive statistics analysis:
If we run Setup variable, we learn that there are in fact four possible alignments, including Reformed Criminals.
Factor is another way of referring to a categorical variable. (As you know, there are two kinds of categorical data: Nominal Data, Ordinal Data.) Factor levels are all of the values that the factor can take (recall that a categorical variable has a set number of groups).
- How many times each value occurs in the
alignvariable?
The usual way to determine the occurrence of each level is to display a frequency table. Select Analyses / Exploration / Descriptives menu. Move the align variable into the ‘Variables’ box, then switch on the Frequency tables check box.
The frequency table indicates that Bad and Good characters are the most popular type of pet among alignments. Neutrals are rare characters in this factor. Three individuals are Reformed Criminals. Frequency tables can also give you information about modes for nominal and ordinal variables, proportions, and cumulative percentages that can be used to extract information about percentiles and percentile ranks. For example, 37.6% of the sample is a good character. For nominal variables, there is no natural order of levels, so we do not interpret the cumulative frequency.
A frequency table lists a set of values and how often each one appears. Frequency is the number of times a specific data value occurs in a variable. These tables help you understand which data values are common and which are rare. These tables organize your data and are an effective way to present the results to others. Frequency tables are also known as frequency distributions because they allow you to understand the distribution of values in a variable.
Frequency distribution tables are a great way to find the mode for a variable.
- Do the same (5-7. points) for identity (
idvariable), after selecteingAnalyses / Exploration / Descriptivesmenu.
We learn that there are four possible identities.
- Save and send the file
comics.omvto theabari.kurzus@gmail.com. The subject of this email isLab07 - Problem 1.
Problem 2
In this section, you will learn how to explor ordinal data.
The Student Survey Data looks at variables from a survey of over 200 students at the University of Adelaide. It can be downloaded from this link. You can find the codebook here.
Open
survey.csvdataset in jamovi.Check the measurement type of the variables and provide a description for each variable.
- Determine the number of rows and columns. Determine the number of observations and variables.
We have 237 observations, or cases, or rows and 13 variables, or columns.
- Check again that the levels of the ordinal variables (
Exer,Smoke) follow each other in increasing order of magnitude.
Exer factorSmoke factorCheck the number of valid and missing values for ordinal variable
ExerandSmoke.Select
Analyses / Exploration / Descriptivesmenu. Move the 2 variables into the ‘Variables’ box.
There is no missing value in the variable Exer, we know how much exercise everyone does. The variable Smoke contains one missing value, i.e. the number of valid data is one less (236) than the number of sample elements.
- Give a frequency table for the two ordinal variables and interpret them. Note that in the case of ordinal variables, the cumulative frequency can also be interpreted.
Exer variableRegarding the variable Exer, the highest number of students who exercise frequently is 48.52% of the total sample, i.e. 115 students. Fewer are those who do some physical activity (98 students, 41.35% of the sample) and the fewest are those who do no physical activity at all. The latter group includes 24 students, representing 10.13% of the sample. Looking at the cumulative frequencies, for example, 51.48% of the sample does at most some physical activity.
Smoke variableAfter reviewing the frequency table of the variable Smoke, we can say that heavy smokers are the smallest, accounting for 4.66% of the sample (only 11), the percentage of regular smokers is 7.2% (17), the percentage of occasional smokers is 8.05% (19), and the largest number (189 students, 80.08% of the sample) is composed of non-smokers. The cumulative frequencies indicate that 88.14% of the sample are at most occasional smokers.
Cumulative frequency means the sum of frequencies of the levels and all the levels below it. It is calculated by adding the frequency of each level lower than the corresponding level category.
- Save and send the file
survey.omvto theabari.kurzus@gmail.com. The subject of this email isLab07 - Problem 2.
Problem 3
In this section, you will learn how to explor numerical data.
We will use data from gapminder, which tracks demographic data in countries of the world over time. It can be downloaded from this link. You can find the codebook here.
For this problem, focus on how the life expectancy differs from continent to continent.
Open
gapminder.csvdataset in jamovi. Delete the variablerownames.Check the measurement type of the variables and provide a description for each variable.
- Determine the number of rows and columns. Determine the number of observations and variables.
We have 1704 observations, or cases, or rows and 6 variables, or columns.
- Learn more about all continuous variables. Determine the number of valid and missing values for these variables.
We can see that there are no missing values in the continuous variables.
As you can see in the figure above (Figure 2), unlike the nominal and ordinal variables, additional rows of the table are also filled in under
N(valid, non-missing values) andMissing(missing values). The meaning of the lines appearing in the table:Mean- The sample mean is a statistic obtained by calculating the arithmetic average of the values of a variable in a sample.Median- A statistic that is also used to indicate the center of a data set but that is not affected by extreme values is the sample median, defined as the middle value when the data are ranked in order from smallest to largest.Standard deviation- A standard deviation is a measure of how dispersed the data is in relation to the sample mean.Minimum- The least or smallest value in a variable.Maximum- The greatest value in a variable.
We can see that the data recorded in the database comes from the period 1952-2007 (based on minimum and maximum of variable
year).What is the typical life expectancy? The most common answer is the mean (59.47), which is the sum of all of the observations divided by the number of observations. Another measure of “typical” or “center” is the median. The median is the middle value in the sorted dataset. So if we sort
lifeExp, the middle value is 60.71.A third measure of center is the mode. The mode is simply the most common observation in the set. Turn on and interpret additional metrics.
Statistics tab.More metrics can be switched on in the Statistics tab:
Mode- The mode is simply the most common observation in the set.IQR- The interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. The IQR contains the middle 50% of sample. It is defined as the difference between the 75th and 25th percentiles of the data.Range- The Range is the difference between the lowest and highest valuesSkewness- Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.Kurtosis- Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers.25th percentile- Also known as the first, or lower, quartile. The 25th percentile is the value at which 25% of the answers lie below that value, and 75% of the answers lie above that value.50th percentile- The 50th percentile or median is the value in the very middle of a set of measurements. In other words, 50% of measurements are under the median and 50% are over the median.75th percentile- Also known as the third, or upper, quartile. The 75th percentile is the value at which 25% of the answers lie above that value and 75% of the answers lie below that value.
A third typical measure of variable lifeExp is 69.39 (based on mode of lifeExp).
- Save and send the file
gapminder.omvto theabari.kurzus@gmail.com. The subject of this email isLab07 - Problem 3.
It is not easy to answer the question of what a typical value is in a variable. The mean can be thought of as the balance point of the data and it tends to be drawn towards the longer tail of a distribution. This highlights an important feature of the mean: its sensitivity to extreme values. For this reason, when working with skewed distributions, the median is often a more appropriate measure of center.
Central tendency measures are statistical methods used to describe the center or typical value in a set of data. There are three primary measures of central tendency: the mean, the median, and the mode. Each of these measures provides a different way of summarizing the center of a dataset. Here’s a characterization of each:
Mean
- The mean, often referred to as the average, is calculated by summing up all the values in a dataset and dividing by the number of data points.
- Formula: Mean = (Sum of all data points) / (Number of data points).
- The mean is sensitive to extreme values or outliers in the dataset. It can be heavily influenced by large or small values, which may not represent the “typical” value when outliers are present.
Median
- The median is the middle value in a dataset when the values are arranged in ascending or descending order. If there is an even number of data points, the median is the average of the two middle values.
- The median is not influenced by extreme values, making it a robust measure of central tendency. It represents the middle value, which is often a better choice when outliers are present.
Mode
- The mode is the most frequently occurring value in a dataset. A dataset may have no mode (if all values occur equally), one mode (if one value occurs most frequently), or multiple modes (if more than one value occurs with the same highest frequency).
- The mode is particularly useful for categorical or nominal data, where values represent categories rather than numerical quantities. In such cases, the mode describes the most common category.
These three measures are complementary and can provide different insights into a dataset. The choice of which measure to use depends on the nature of the data and the specific research or analytical goals. It’s common to use all three measures in conjunction to gain a more complete understanding of the central tendency in a dataset.
Problem 4
In this section, you will learn how to explor numerical data split by categorical data.
The problem also uses data from gapminder, which tracks demographics in countries around the world over time. For this proble, we will focus on how the lif e expectancy differs from continent to continent. This requires that you conduct your analysis at continent level.
Open
gapminder.omvdataset from Problem 3 in jamovi.Filter out the dataset that contains only 2007 data.
- Calculate the mean and median life expectancy for each continent.
lifeExp split by continentlifeExp split by continent- If you’re unsure whether you’re working with symmetric or skewed distributions, it’s a good idea to consider a robust measure like IQR in addition to the usual measures of variance or standard deviation. For each continent, summarize life expectancies using standard deviation and IQR.
lifeExp split by continentlifeExp split by continent- Save and send the file
gapminder.omvto theabari.kurzus@gmail.com. The subject of this email isLab07 - Problem 4.
Spread measures in statistics are used to describe the extent to which data points in a dataset vary or are dispersed from the central tendency measures (mean, median, mode). Spread measures provide insights into the distribution of the data, how widely it is distributed, and the degree of variability within the dataset. There are several common spread measures, and here are a few of the most important ones:
Range
- The range is the simplest spread measure and is calculated as the difference between the maximum and minimum values in a dataset.
- Formula: Range = Maximum Value - Minimum Value.
- While the range is easy to compute, it can be sensitive to outliers and might not provide a robust measure of spread, especially when dealing with skewed or irregular distributions.
Interquartile Range (IQR)
- The IQR is a more robust measure of spread that is based on quartiles. Quartiles divide the data into four equal parts, and the IQR is the range of the middle 50% of the data.
- Formula: IQR = Q3 (third quartile) - Q1 (first quartile).
- The IQR is less affected by extreme values and is particularly useful for skewed datasets.
Variance and Standard Deviation
- Variance measures the average squared deviation of data points from the mean. Standard deviation is the square root of variance and is expressed in the same units as the original data.
- Formulas: Variance (s²) = Σ((x - \(\bar{x}\))²) / (n - 1), Standard Deviation (s) = √Variance.
- Variance and standard deviation provide a more detailed assessment of the spread and are sensitive to the magnitude of deviations from the mean.
The choice of which spread measure to use depends on the nature of the data, the research question, and the presence of outliers. Each spread measure provides different information about the data’s variability, and the appropriate measure to use may vary from one situation to another.
Like mean and standard deviation, median and IQR measure the central tendency and spread, respectively, but are robust to outliers and non-normal data.
Problem 5
In this section, you will learn how to apply your knowledge of descriptive statistics to categorical and numerical variables.
You will use data from birthwt dataset. The data were collected at Baystate Medical Center, Springfield, Mass during 1986. It can be downloaded from this link. You can find the codebook here.
Formulate 4 questions about the birthwt database. Answer them according to Problem 1-4 in this Lab.
The questions and the text answers, as well as the screenshots of the jamovi, are summarised in a
birthwt.htmlfile.Send your
birthwt.htmlandbirthwt.omvto theabari.kurzus@gmail.com. The subject of this email isLab07 - Problem 5. The better the solution, the more badges you get (from four to zero).